I have a typical project of predicting the NYC uber/lyft trip demand. The dataset is available from Jan2022 to March 2023. The area is already divided into different locations. and I want the predicted demand for each location every 15 mins
The goal of this project is to predict the demand for Uber/Lyft trips in different locations of NYC every 15 minutes, using a dataset spanning from January 2022 to March 2023. The dataset includes information such as the dispatching base number, pickup datetime, drop-off datetime, pickup location ID, drop-off location ID, SR_Flag, and affiliated base number
import pandas as pd
import glob
import tqdm
import pandas as pd
import plotly.graph_objects as go
from statsmodels.tsa.arima.model import ARIMA
from dateutil.relativedelta import relativedelta
import numpy as np
from pmdarima import auto_arima
# Uses the glob.glob function to retrieve a list of file paths that match the specified
# pattern 'Datasets/fhv_tripdata_2022-2023_in_csv/*.csv'.
# This pattern is used to find all CSV files in the given directory.
data_list_path = glob.glob('Datasets/fhv_tripdata_2022-2023_in_csv/*.csv')
# Initializes an empty list called list_df to store the DataFrames.
list_df = []
# terates over each file path in data_list_path
for path in data_list_path:
print(path)
# Step 1: Preprocess the Dataset
# inside the loop, it reads each CSV file using pd.read_csv and assigns it to the variable df.
df = pd.read_csv(path)
# Appends the DataFrame to the list_df list.
list_df.append(df)
# After the loop, it concatenates all the DataFrames in list_df into a single DataFrame using pd.concat.
# The concatenated DataFrame is assigned to the variable df
df = pd.concat(list_df)
# Specifies a list of column names ('pickup_datetime' and 'PUlocationID')
# in interested_features that you are interested in keeping
interested_features = ['pickup_datetime','PUlocationID']
# Updates df to contain only the columns specified in interested_features using indexing.
df = df[interested_features]
# Summary :
# Overall, this code reads multiple CSV files from the specified directory,
# concatenates them into a single DataFrame, and then selects and keeps only the columns specified in interested_features
Datasets/fhv_tripdata_2022-2023_in_csv/fhv_tripdata_2022-09.csv Datasets/fhv_tripdata_2022-2023_in_csv/fhv_tripdata_2022-02.csv Datasets/fhv_tripdata_2022-2023_in_csv/fhv_tripdata_2022-04.csv Datasets/fhv_tripdata_2022-2023_in_csv/fhv_tripdata_2022-07.csv Datasets/fhv_tripdata_2022-2023_in_csv/fhv_tripdata_2022-01.csv Datasets/fhv_tripdata_2022-2023_in_csv/fhv_tripdata_2022-06.csv Datasets/fhv_tripdata_2022-2023_in_csv/fhv_tripdata_2022-08.csv Datasets/fhv_tripdata_2022-2023_in_csv/fhv_tripdata_2023-03.csv Datasets/fhv_tripdata_2022-2023_in_csv/fhv_tripdata_2022-11.csv Datasets/fhv_tripdata_2022-2023_in_csv/fhv_tripdata_2022-12.csv Datasets/fhv_tripdata_2022-2023_in_csv/fhv_tripdata_2023-02.csv Datasets/fhv_tripdata_2022-2023_in_csv/fhv_tripdata_2022-03.csv Datasets/fhv_tripdata_2022-2023_in_csv/fhv_tripdata_2023-01.csv Datasets/fhv_tripdata_2022-2023_in_csv/fhv_tripdata_2022-05.csv Datasets/fhv_tripdata_2022-2023_in_csv/fhv_tripdata_2022-10.csv
# The code imports the necessary libraries:
import pandas as pd
import pmdarima as pm
import plotly.graph_objects as go
from sklearn.model_selection import train_test_split
# Prints the number of rows in the DataFrame df before removing rows with NaN values
# This line uses the .shape[0] attribute of a DataFrame to retrieve the number of rows.
print('Number of Rows Before Removing NaN:', df.shape[0])
# Removes rows with NaN values from the DataFrame df and assigns the result to removed_nan_df:
removed_nan_df = df.dropna()
#The .dropna() method is used to remove rows containing any NaN values.
#The resulting DataFrame with NaN rows removed is assigned to removed_nan_df/
print('Number of Rows After Removing NaN:', removed_nan_df.shape[0])
Number of Rows Before Removing NaN: 17712727 Number of Rows After Removing NaN: 4164902
# Retrieves unique values from the 'PUlocationID' column in the
# DataFrame removed_nan_df and converts them to a list:
location_ids = removed_nan_df['PUlocationID'].unique().tolist()
# Initializes a loop counter variable:
loop_count = 0
# Iterates over each unique location ID in location_ids:
for lc_id in location_ids:
# Prints the current location ID:
print('Location ID:', lc_id)
# Filters removed_nan_df to create a subset DataFrame (df_subset)
# containing rows with a specific 'PUlocationID
df_subset = removed_nan_df[removed_nan_df['PUlocationID'] == lc_id]
# Converts the 'pickup_datetime' column in df_subset to datetime format using pd.to_datetime
df_subset['pickup_datetime'] = pd.to_datetime(df_subset['pickup_datetime'])
# Sorts df_subset based on the 'pickup_datetime' column:
df_subset = df_subset.sort_values('pickup_datetime')
# Sets the 'pickup_datetime' column as the index of df_subset:
df_subset = df_subset.set_index('pickup_datetime')
# Resamples df_subset at a frequency of 1 hour ('1H') and counts the number of occurrences per hour:
df_subset = df_subset['PUlocationID'].resample('1H').count()
# Resets the index of df_subset to convert the index (time) back into a column:
df_subset = df_subset.reset_index()
# Split data into training and testing sets
# Determines the train-test split based on the length of df_subset,
# with 95% of the data used for training and the remaining 5% for testing:
train_size = int(len(df_subset) * 0.95)
train_data = df_subset[:train_size]
test_data = df_subset[train_size:]
# Perform auto ARIMA on training data
# Uses the pm.auto_arima function from the pmdarima
# library to automatically determine the best ARIMA model for the training data:
model = pm.auto_arima(train_data['PUlocationID'], seasonal=True, trace=True)
# Generate predictions
# Generates predictions (forecast) using the ARIMA model and
# calculates the confidence interval (conf_int) for the length of the test data
forecast, conf_int = model.predict(n_periods=len(test_data), return_conf_int=True)
# Create a dataframe for predictions and actual values
result_df = pd.DataFrame({
'pickup_datetime': test_data['pickup_datetime'],
'Actual': test_data['PUlocationID'],
'Forecast': forecast
})
# Save the dataframe to a CSV file
filename = f"arima-results/{lc_id}_data.csv"
result_df.to_csv(filename, index=False)
# Plotting
# Creates a Plotly figure (fig) and adds traces for the training data, testing data, and ARIMA forecast:
fig = go.Figure()
fig.add_trace(go.Scatter(x=train_data.index, y=train_data['PUlocationID'], mode='lines+markers', name='Training Data'))
fig.add_trace(go.Scatter(x=test_data.index, y=test_data['PUlocationID'], mode='lines+markers', name='Testing Data'))
fig.add_trace(go.Scatter(x=test_data.index, y=forecast, mode='lines+markers', name='ARIMA Forecast'))
# Updates the layout of the figure with a title and axis labels:
fig.update_layout(title=f'PickLocation ID: {lc_id}', xaxis_title='Time', yaxis_title='Number Drives')
# Displays the figure:
fig.show()
loop_count +=1
# Checks if the loop count is greater than 5 and breaks the loop if so:
if loop_count >1:
break
# Summary
# if you want to code for each location then remove if condition of loop counter
# Overall, this code performs time series analysis and forecasting for each location ID, allowing for an understanding
# of the patterns and trends in the dataset at different locations.
Location ID: 12.0 Performing stepwise search to minimize aic
/tmp/ipykernel_10703/3719424878.py:16: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_subset['pickup_datetime'] = pd.to_datetime(df_subset['pickup_datetime'])
ARIMA(2,0,2)(0,0,0)[0] intercept : AIC=-11849.933, Time=18.91 sec ARIMA(0,0,0)(0,0,0)[0] intercept : AIC=-11825.381, Time=4.77 sec ARIMA(1,0,0)(0,0,0)[0] intercept : AIC=-11844.742, Time=10.76 sec ARIMA(0,0,1)(0,0,0)[0] intercept : AIC=-11843.762, Time=4.85 sec ARIMA(0,0,0)(0,0,0)[0] : AIC=-11650.357, Time=0.41 sec ARIMA(1,0,2)(0,0,0)[0] intercept : AIC=-11854.674, Time=35.79 sec ARIMA(0,0,2)(0,0,0)[0] intercept : AIC=-11847.957, Time=7.66 sec ARIMA(1,0,1)(0,0,0)[0] intercept : AIC=-11846.164, Time=6.36 sec ARIMA(1,0,3)(0,0,0)[0] intercept : AIC=-11844.097, Time=18.47 sec ARIMA(0,0,3)(0,0,0)[0] intercept : AIC=-11847.073, Time=9.55 sec ARIMA(2,0,1)(0,0,0)[0] intercept : AIC=-11854.373, Time=21.05 sec ARIMA(2,0,3)(0,0,0)[0] intercept : AIC=-11853.703, Time=107.06 sec ARIMA(1,0,2)(0,0,0)[0] : AIC=inf, Time=9.08 sec Best model: ARIMA(1,0,2)(0,0,0)[0] intercept Total fit time: 254.856 seconds
Location ID: 89.0
/tmp/ipykernel_10703/3719424878.py:16: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Performing stepwise search to minimize aic ARIMA(2,1,2)(0,0,0)[0] intercept : AIC=50398.076, Time=44.34 sec ARIMA(0,1,0)(0,0,0)[0] intercept : AIC=51768.990, Time=1.06 sec ARIMA(1,1,0)(0,0,0)[0] intercept : AIC=50564.942, Time=2.41 sec ARIMA(0,1,1)(0,0,0)[0] intercept : AIC=50454.990, Time=3.87 sec ARIMA(0,1,0)(0,0,0)[0] : AIC=51766.991, Time=0.28 sec ARIMA(1,1,2)(0,0,0)[0] intercept : AIC=inf, Time=50.57 sec ARIMA(2,1,1)(0,0,0)[0] intercept : AIC=50447.109, Time=8.11 sec ARIMA(3,1,2)(0,0,0)[0] intercept : AIC=50260.815, Time=33.35 sec ARIMA(3,1,1)(0,0,0)[0] intercept : AIC=50442.066, Time=15.09 sec ARIMA(4,1,2)(0,0,0)[0] intercept : AIC=50238.059, Time=57.19 sec ARIMA(4,1,1)(0,0,0)[0] intercept : AIC=50429.196, Time=15.57 sec ARIMA(5,1,2)(0,0,0)[0] intercept : AIC=50165.209, Time=65.94 sec ARIMA(5,1,1)(0,0,0)[0] intercept : AIC=50401.097, Time=18.52 sec ARIMA(5,1,3)(0,0,0)[0] intercept : AIC=inf, Time=88.14 sec ARIMA(4,1,3)(0,0,0)[0] intercept : AIC=inf, Time=70.16 sec ARIMA(5,1,2)(0,0,0)[0] : AIC=50163.210, Time=11.05 sec ARIMA(4,1,2)(0,0,0)[0] : AIC=50236.059, Time=7.19 sec ARIMA(5,1,1)(0,0,0)[0] : AIC=50399.097, Time=4.14 sec ARIMA(5,1,3)(0,0,0)[0] : AIC=inf, Time=13.49 sec ARIMA(4,1,1)(0,0,0)[0] : AIC=50427.197, Time=1.92 sec ARIMA(4,1,3)(0,0,0)[0] : AIC=inf, Time=8.28 sec Best model: ARIMA(5,1,2)(0,0,0)[0] Total fit time: 520.790 seconds